Advanced R Programming - Final Project
2026-02-08
Research Question 2: Is there a correlation between years taken to reach Unicorn status and current valuation? Dataset: unicorn_companies.csv containing data on 1,074 companieS. —
Initial inspection revealed 1,037 rows and 13 columns.
Data Types: All columns were initially read as characters (<chr>).
Cleaning Needs:
Valuations: Contained “$” symbols.
Founded Year: Contained “None” strings.
Dates: Required parsing into date-time objects.
A robust pipeline was implemented to prepare the data for analysis .
df_clean <- df |>
rename(Valuation_B = `Valuation ($B)`,
Date_Joined = `Date Joined`,
Founded_Year = `Founded Year`) |>
mutate(
Valuation_B = as.numeric(str_remove(Valuation_B, "\\$")),
Date_Joined = parse_date_time(Date_Joined, orders = c("mdy", "dmy", "ymd")),
Founded_Year = as.numeric(na_if(as.character(Founded_Year), "None"))
)Missing founding years were filled using the industry-specific median to maintain data integrity.
The analysis identified clear trends in sector performance and geographic dominance .
Top Hubs: USA, China, and India account for the majority of the global unicorn population.
Fintech Dominance: Fintech has the highest count of unicorns and a high average valuation.
AI Efficiency: The Artificial Intelligence sector represents the highest “Growth Rate” (valuation appreciation per year).
# SETUP & DATA CLEANING
library(tidyverse)
library(lubridate)
library(plotly)
library(DT)
library(patchwork)
library(scales)
# Load and clean data directly from the project logic
df <- read_csv("unicorn_companies.csv")
df_clean <- df |>
rename(Valuation_B = `Valuation ($B)`,
Date_Joined = `Date Joined`,
Founded_Year = `Founded Year`) |>
mutate(
Valuation_B = as.numeric(str_remove(Valuation_B, "\\$")),
Date_Joined = parse_date_time(Date_Joined, orders = c("mdy", "dmy", "ymd")),
Founded_Year = as.numeric(na_if(as.character(Founded_Year), "None"))
) |>
# Advanced Imputation using industry median (.by requirement)
mutate(Founded_Year = if_else(is.na(Founded_Year),
median(Founded_Year, na.rm = TRUE),
Founded_Year), .by = Industry) |>
mutate(Join_Year = year(Date_Joined),
Years_to_Unicorn = Join_Year - Founded_Year,
Is_US = if_else(Country == "United States", "US", "International")) |>
filter(Years_to_Unicorn >= 0)
# --- 2. MULTI-PANEL DASHBOARD (GGPPLOT2 & PATCHWORK) ---
# Plot A: Valuation vs. Speed to Unicorn
p1 <- ggplot(df_clean, aes(x = Years_to_Unicorn, y = Valuation_B, color = Industry)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "black", se = FALSE) +
scale_y_log10(labels = label_dollar()) +
labs(title = "Valuation vs. Scaling Speed", x = "Years to Unicorn", y = "Valuation ($B)") +
theme_minimal() + theme(legend.position = "none")
# Plot B: Top 5 Industries by Company Count
p2 <- df_clean |>
count(Industry) |>
slice_max(n, n = 5) |>
ggplot(aes(x = reorder(Industry, n), y = n, fill = Industry)) +
geom_col() + coord_flip() +
labs(title = "Top 5 Growth Industries", x = "", y = "Count") +
theme_minimal() + theme(legend.position = "none")
# Plot C: Global Unicorn Growth Over Time
p3 <- df_clean |>
count(Join_Year) |>
ggplot(aes(x = Join_Year, y = n)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point() +
labs(title = "Unicorn Emergence by Year", x = "Year Joined", y = "New Unicorns") +
theme_minimal()
# Plot D: Valuation Spread (US vs. International)
p4 <- ggplot(df_clean, aes(x = Is_US, y = Valuation_B, fill = Is_US)) +
geom_boxplot() +
scale_y_log10(labels = label_dollar()) +
labs(title = "Market Valuation: US vs. Intl", x = "", y = "Valuation ($B)") +
theme_minimal() + theme(legend.position = "none")
# Merge into unified Dashboard
(p1 | p2) / (p3 | p4) + plot_annotation(title = "Global Unicorn Ecosystem Analysis")# --- 3. GEOGRAPHIC DISTRIBUTION (FACETED HISTOGRAM) ---
p5 <- df_clean |>
filter(Country %in% c("United States", "China", "India")) |>
ggplot(aes(x = Valuation_B, fill = Country)) +
geom_histogram(bins = 20, color = "white", boundary = 0) +
facet_wrap(~Country, scales = "free_y") +
scale_x_log10(labels = label_dollar()) +
labs(title = "Valuation Distribution: Top 3 Hubs", x = "Valuation ($B)", y = "Count") +
theme_minimal() + theme(legend.position = "none")
print(p5)Hypothesis: “Blitz-scaling” (reaching $1B faster) leads to higher valuations.
Finding: A slight negative correlation exists.
Quantitative Data:
Fast Track (<= 3 yrs): Higher average valuation.
Slow Track (> 10 yrs): Lower average valuation .
A multi-panel dashboard was constructed using patchwork to visualize the ecosystem.
Panel 1: Valuation vs. Years to Unicorn (Scatter) .
Panel 2: Top 5 Industries by Count (Bar).
Panel 3: Unicorn Growth Over Time (Line).
Panel 4: Valuation Spread: US vs. International (Boxplot) .
Geographic Hubs: USA dominates Software/Services; China leads in Hardware and AI .
Speed Matters: The “first-mover advantage” is supported by higher valuations for fast-scaling companies.
Survivor Bias: Dataset only includes companies that reached $1B.
Static Valuations: “Paper values” may not reflect current liquid value in volatile markets .se